DISTRIBUTED OPTIMIZATION VIA ADAPTIVE REGULARIZATION FOR LARGE 
PROBLEMS WITH SEPARABLE CONSTRAINTS 
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ABSTRACT 

Many practical applications require solving an optimization 
over large and high-dimensional data sets, which makes these 
problems hard to solve and prohibitively time consuming. In 
this paper, we propose a parallel distributed algorithm that 
uses an adaptive regularizer (PDAR) to solve a joint optimiza- 
tion problem with separable constraints. The regularizer is 
adaptive and depends on the step size between iterations and 
the iteration number. We show theoretical converge of our 
algorithm to an optimal solution, and use a multi-agent three- 
bin resource allocation example to illustrate the effectiveness 
of the proposed algorithm. Numerical simulations show that 
our algorithm converges to the same optimal solution as other 
distributed methods, with significantly reduced computational 
time. 

1. INTRODUCTION 

With the sensor and the storage technologies becoming in- 
creasingly cheaper, modern applications are seeing a sharp 
increase in big data. The explosion of such high-dimensional 
and complex data sets makes optimization problems ex- 
tremely hard and prohibitively time consuming (T) . Parallel 
computing has received a significant attention lately as an ef- 
fective tool to achieve the high throughput processing speeds 
required for processing big data sets. Thus, there has a been 
a paradigm shift from aggregating multi-core processors to 
utilizing them efficiently J2] . 

Although distributed optimization has been an increas- 
ingly important topic, it has not received sufficient attention 
since the seminal work by Bertsekas and Tsitsiklis until re- 
cently. In the 1980's, Bertsekas and Tsitsiklis extensively 
studied decentralized detection and consensus problems [3| 
and developed algorithms such as parallel coordinate descent 
[4 1 and the block coordinate descent (BCD) (also called the 
block Jacobi) J3]|5]. In 1994, Ferris et. al. proposed paral- 
lel variable distribution (PVD) [6| that alternates between a 
parallelization and a synchronization step. In the paralleliza- 
tion step, several sub-optimal points are found using parallel 



optimizations. Then, in the synchronization step, the optimal 
point is computed by taking an optimal weighted average of 
the points found in the parallel step. Although PVD claims 
to achieve better convergence rate than BCD, the complexity 
of solving optimization in both the steps make it impractical 
for high dimensional problems. There are other efficient dis- 
tributive methods in literature, such as the shooting Q, the 
shotgun [2 1, and the alternating direction method of multi- 
pliers (ADMM) [1|, however, these methods apply to only a 
specific type of optimization problems: ^i-regularization for 
shooting and shotgun, and linear constraints for ADMM. 

In this paper, we propose a fully distributed parallel 
method to solve optimization problems over high-dimensional 
data sets, which we call the parallel distributive adaptive reg- 
ularization (PDAR). Our method can be applied to a wide 
variety of nonlinear problems where the constraints are block 
separable. The assumption of block separable constraints 
holds good in a lot of practical problems, and can be com- 
monly seen in problems such as multi-agent resource allo- 
cation. In order to coordinate among the subproblems we 
introduce an adaptive regularizer term that penalizes the large 
changes in successive iterations. Our method can be seen as 
an extension of the classical proximal point method (PPM) 
[8 1 with two novel advances. First, our motivation for using 
the PPM framework is very different than the original. We 
use PPM as a means to coordinate among the parallel sub- 
problems and not for handling non differentiability. Second, 
we enforce coordination by using adaptive regularizes that 
vary across different subproblems. 

The rest of the paper is organized as follows. In Section|2j 
we formulate the problem; in Section[3]we propose our paral- 
lel distributive algorithm and show converges to an optimum 
solution; in Section|4]we provide numerical simulations, and 
we conclude the paper in Section[5] 



2. PROBLEM FORMULATION 

Consider an optimization problem given as: 
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minimize f(x) 
subject to x e X, 



(1) 

(2) 



where the objective is to find the optimal vector x* that min- 
imizes the function f(x) G R, with x 6 R d . The problem 
is often very complex, nonlinear, and high dimensional, and 
solving it is prohibitively time consuming. We assume that 
the constraint x G X can be separated into several blocks, 
such that 



Algorithm: PDAR 



x = [xi,x 2 , ...,x it ... ,x N ] where, x { E Xi, 



(3) 



with Xi 6 R ni and 53i=i n i = ^- Once the problem is sepa- 
rated into blocks, distributed iterative approaches (such as the 
ones mentioned in the Introduction section) can be applied. 
However, these methods are time consuming when the sub- 
problems are themselves complex. 



k = 1; % Iteration counter 
Initialize x° and A° V i 
do 

parfor i in 1 : N 

x\ = arg min Li(x i \x k ~ 1 ) 

Seth k = x k -x k ~ l 
Update A* 
end parfor 
k := k + 1 
until H/^) -/Or"" 1 )!! <S 



Table 1: Algorithm for Parallel Distributed Optimization 



3. DISTRIBUTED OPTIMIZATION VIA ADAPTIVE 
REGULARIZATION 

In this section, we describe our distributed optimization 
framework with adaptive regularization. We solve the op- 
timization problem given by Eq. (HJ in a parallel and iterative 
manner. Let k denote the iteration index and x = (x^x^), 



with x_, = 



X-\ , . . . , X:_^ , X: 



■ ■ j x N 



denote the solu- 



tion to the optimization problem in the k th iteration. In order 
to obtain a solution in a distributed manner, we define a set of 
N augmented objective functions at each iteration k asQ 



■ k-l |i2 



(4) 



where h^ 1 = x^ 1 - x^ 2 is the step taken by the i th 
block in the (fc — l) th iteration, and ^{h^ 1 ) is an adap- 
tive regularization coefficient which depends on both the in- 
dices i and k. We will describe the form of this regulariza- 
tion coefficient shortly. After defining the objective functions 
Li(-), i = 1, . . . , N, we solve N optimization problems in a 
parallel fashion: 



• k-2 



arg min L\(x\\x 



arg min Lf(x2',x x ), 

12 6^2 



X 



N 



arg min L-{xm\x 



k-l\ 



(5) 



every iteration, the objective functions will change in every 
iteration. 

Next, we discuss the choice of the regularization coeffi- 
cient Af^ -1 ). We chose A^(ft* _1 ) to be of the form: 



_ / max(^(||^ 
ak 



if k < K 
otherwise, 



(6) 



where if is a threshold on the iteration index, a > 0, and 
f3 > are parameters chosen depending on the problem. In- 
tuitively, the threshold K divides each optimization problem 
into two phases. The goal of the first phase is to coordinate 
the parallel optimization. In this phase, each of the agents 
change their solution in response to the solutions of other 
agents. This alternating behavior can be enforced by choosing 
the function tpdlh^ -1 1|) to be a nondecreasing with respect to 
This choice will increase the value of regularization 
coefficient, A^ft.^ -1 ) as increases. The increase in 

Xiih} 1 ) will in turn enforce a smaller stepsize on the agents 
that had large change in the previous iteration, to allow other 
agents to react in the current iteration. The goal of the sec- 
ond phase is to fine tune the solution and to enable it reach 
a local optimum. Although the choice of the function <f> and 
the parameters a, and j3 theoretically effect the convergence 
of the optimization, we observed using numerical simulations 
that the convergence was not sensitive to these choices. In this 
paper we choose \ \ h k ~ 1 1 1 ' - v 2 ' ■' ''" ' J 
summarized in Table Q] 



= 2V ||. The algorithm is 



This optimization framework is in the form of a decomposition- 
coordination procedure (2, where N agents are trying to 
minimize their own augmented objective functions, and the 
new joint vector x k is obtained by simply aggregating the 
N blocks. Further, since the minimization of the objective 
functions L k (-) is only with respect to the variables of the i th 
block and the other blocks are constants which change with 



'We use a semicolon notation in Eq. j4) to clarify that only the variables 
on the left of the semicolon are allowed to change. 



3.1. Discussion on the Convergence 

In this section, we show that the algorithm described in the 
previous subsection converges to an optimum solution. As- 
sume that the function f(x) is convex. Since the augmented 
function L k (-), i = 1, . . . , N is the sum of two convex func- 
tions, it is convex. We then have 



arg min L k (xf, x k 1 ). 



(7) 



Since x k is a minimizer of L k (xf, x K 1 ), we have by the first 
order necessary conditions for local optimum that 



fe-ls 



V l L k (x i ;x 



-0, 



V t f{xlx k _- 1 ) + 2\ k -(ht 1 ) (x k - A*- 1 ) = 0, 



= -2A?(fe*- 1 )M 



k(uk-l\ 



2X k Ah 



(8) 



where the operator V,; is a gradient operator with respect to 
Xi. For fc > K we have A^(ft£ -1 ) = ah, and therefore Eq. 
(0 simplifies as 



1 



k-l 



(9) 



where d* 1 is the negative gradient direction of the i th agent 
By concatenating all the directions into a single vector d k = 
[d k . d k , . . . , d k N ], we get the next iterate x k as 



(10) 



where h k — . We prove the convergence properties of the 
algorithm using the following two prepositions. 

Proposition 1: For the sequence of non-stationary iterates x 
obtained from the PDAR algorithm, Vf{x k ~ l )'d k < 0. @ 
Proof: From the definition of d k , we have 

d^-VJixtxtl 1 ). (ID 

Therefore, 

JV 

Vf(x k - 1 )'d k = J2 -Vi/Csfe*- 1 )^/^, x^ 1 ). (12) 

i=l 

Since x k is a result of minimizing L k (xi\ x k ~ 1 ), the corre- 
sponding step h k must be in a descending direction. Thus 



V i L k (x k - 1 )'h k = Vi/O^" 1 )'^ < 0, Vz 



fc-iy . fc 



(13) 



However, there must exist at least one block where the 
strict inequality V i f(x k ~~ 1 )'h k < holds. We prove this 
by contradiction. Assume that Vi, V i f(x k ^ 1 )'h k = 0. 
If h k = 0,Vi, then x k is a stationary point which contra- 
dicts the assumption of convergence to a nonstationary point. 
Hence there exists some i, for which h k ^ 0. Now, since 



2 For brevity, if all the blocks in the function are from the same iteration, 
we will simplify the notation, i.e., /(cc^ - 1 , x 1 ^ 1 ) = f(x k ~ 1 ) 



L k (x k ; x k is a convex function, it must lie above all of its 
tangents, i.e., 

L k (x k ; x^ 1 ) > L k (x k - X ) + Viif (x*" 1 )'^. (14) 



Since V,Lf {x')'h k = V l f{x')'h k = 0, we have 
fromEq. QI), that L k (x k ) > L^x^ 1 ). This is a contradic- 
tion, since every iterate should reduce the objective function 
corresponding to the block. Intuitively, this inequality implies 
that if the step size is perpendicular to the gradient of the ob- 
jective function, then such steps do not decrease the value of 
the objective function. Hence there exists at least one block 
that satisfies inequality ^if{x k ~ 1 )'h k < 0. Finally, since at 
least one block satisfies the strict inequality, their summation 
satisfies strict inequality: 



^Vifixt'Yh^KQ, 



i=i 

N 



=> Vf(x k - L )'d k < 0. 

Proposition 2: The sequence x k converges to an optimal so- 
lution. 

Proof: Assume that / satisfies Lipschitz continuity of the gra- 
dients, and that its gradients are bounded. Formally, we need 
to show that for any subsequence {x } that converges to a 
nonstationary point, the corresponding subsequence {d k } is 
bounded and satisfies 1 8 1 : 



lim S u Pke!C Vf(x k - 1 )'d(x k ) < 



(15) 



where d{x k ) = - J2? =1 V l f{x k ,x k S i 1 ). Let e > 0, and 
{x k }keic be an arbitrary sequence of nonstationary points 
such that 

lim sup k&K x k = x, 

where V/(x) ^ 0. Then Vfc € K, the gradients are not 
equal to zero, V/(a; fe ) ^ 0, since the sequence has non- 
stationary points. Using Proposition 1, we have that Vfc 6 
K, Vf{x k ~ 1 )'d{x k ) < 0, and specifically Vf{x)'d{x) = 
D 1 < 0. 

By the Lipschitz continuity assumption of the gradients, 
there 3 <5 > such that \\Vf(y)'d(y) - Vf(x)'d{x)\\ < e, 
V \\y - x\\ < 5. Since x k -> x, 3 N e N such that Vfc > 
N, \\x k - x\\ < 6, and thus 

WVfix^Ydix') -Vf(x)'d(x)\\ < e. 

This implies that V f(x k - 1 )'d(x k ) < D\ + e. As e > 
is arbitrai-y, limfc^oo sup fcgK V/(a3 fe_1 )'d(a; fe ) = D\ < 0. 
Hence the sequence of iterates x k converges to an optimal 
solution. 



4. NUMERICAL RESULTS 

In this section, we provide numerical results to compare the 
convergence of the proposed distributed algorithm to those of 
the block coordinate descent (BCD) and parallel variable dis- 
tribution (PVD) . We consider a three-bin resource allocation 
example for the numerical simulation. Let there be N = 100 
agents. Each agent has fixed quantity of resources that are to 
be allocated among three bins. Let a;, = 2^,2, 2^,3]' de- 
note the allocation scheme of the i th agent. Without loss of 
generality, let Ylj=i x i-j = 1' ^ The objective is to mini- 
mize the sum of the individual costs, where the cost of agents 
depends on their own scheme and the schemes of other agents. 

Let x = [x\, x' 2 , ■ ■ ■ , x' N ]' denote the collective scheme 
of all agents. The cost function of the i th agent is taken as 

f i {x) = x' i P i g{x), (16) 

where Pi = diag^ i,Pi,2>Pi,2) denotes the preference ma- 
trix of the i th agent for each bin, and g(x) = [51, 52, 53]' is a 
function dependent on the schemes of all agents, with 

/ N \ 2 

9m = I Bi,m ) i me {1,2, 3}. (17) 
The goal is to solve the optimization problem: 

N 3 

min fj (x) subject to Vjajjj = l,Vi. (18) 

»=1 4=1 

In order to find the solution to the above joint optimization 
problem, we solved N = 100 subproblems in parallel us- 
ing our proposed PDAR. The optimization problem of the i th 
agent in the k th iteration is given as 

min fiixuStl 1 ) + >${h h i - 1 )\\x i -$f L f 

Xi 

» (19) 
subject to y Xij = 1. 

In Fig.[Ta] we plot the value of the objective function as a 
function of the normalized time for BCD, PVD and our PDAR 
approach. We ran all the simulations on a 4 core machine. 
However in principle the parallel methods can be run on 100 
cores simultaneously. Hence, it order to make the compari- 
son fair, the time axis corresponding to parallel methods was 
divided by 25. As illustrated, the convergence rate of our 
method is of an order of magnitude faster compared to BCD 
and PVD algorithms. The advantage comes from the fact that 
we can solve all the 100 optimization problems in parallel, 
whereas BCD is a sequential method. The PVD method, on 
the other hand, is worse even though it has a parallel update 
step. The additional time it takes to converge is due to the 



synchronization step, and due to the complexity of the op- 
timization problems that are to be solved in both steps. In 
Fig. [Tb] we show the oscillatory behavior when the parallel 
algorithm is used with out a regularizer. This figure further 
emphasizes the importance of a regularizer. 




Time (s) 
(b) 

Fig. 1: Value of the objective functions vs time for the three 
bin resource allocation problem. Fig. [Ta] shows that PDAR 
converges much faster compared to BCD and PVD. Fig. [Tb] 
shows the oscillatory behavior of the parallel optimization 
without regularization. 



5. CONCLUSIONS 

In this paper, we proposed a distributed optimization frame- 
work to solve large optimization problems with separable 
constraints. Each agent solves a local optimization problem, 
which is much simpler compared to the joint optimization. 
In order for the agents to coordinate among themselves and 
to reach an optimum solution, we introduced a regularization 
term that penalized the changes in the successive iterations 
with an adaptive regularization coefficient. We proved that 
our solution always converges to a local optimum, and to a 
global optimum if the overall objective function is convex. 
Numerical simulations showed that the solutions reached by 
our algorithm are the same as the ones obtained using other 
distributed approaches, with significantly reduced computa- 
tion time. 
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